Analysis of Airbnb NYC

Vaibhavi Mulay

Airbnb is a paid community platform for renting and booking private accommodation founded in 2008. Airbnb allows individuals to rent all or part of their own home as extra accommodation. The site offers a search and booking platform between the person offering their accommodation and the vacationer who wishes to rent it. It covers more than 1.5 million advertisements in more than 34,000 cities and 191 countries. From creation, in August 2008, until June 2012, more than 10 million nights have been booked on Airbnb.

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.

Problem Statement

Best prediction model for price i.e relationship between the price and other factors

Audience

Travelers and Hosts using Airbnb

Dataset

The dataset which we have used over here is New York City Airbnb Open Data. The dataset is available on kaggle. It has 16 columns and 48895 rows.

Below you will find the implementation of a few processes we have done for analysis. You can jump to the sections:

1. Data Cleaning
2. Exploratory Data Cleaning
3. Statistics and Machine Learning

Data Setup

First we will import the library such as numpy, scipy and matplotlib to manipulate, analyze and visualize our data. The second task for setting up our data set is by importing our dataset from a csv to our notebook. Here the csv file is converted into a set of data frames

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline
import seaborn as sns
import pandas_profiling

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from math import sqrt
from sklearn.metrics import r2_score
In [2]:
#using pandas library and 'read_csv' function to read BlackFriday csv file as file already formated for us from Kaggle
airbnb=pd.read_csv('AB_NYC_2019.csv')
#examing head of BlackFriday csv file 
airbnb.head(10)
Out[2]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0
5 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 3 74 2019-06-22 0.59 1 129
6 5121 BlissArtsSpace! 7356 Garon Brooklyn Bedford-Stuyvesant 40.68688 -73.95596 Private room 60 45 49 2017-10-05 0.40 1 0
7 5178 Large Furnished Room Near B'way 8967 Shunichi Manhattan Hell's Kitchen 40.76489 -73.98493 Private room 79 2 430 2019-06-24 3.47 1 220
8 5203 Cozy Clean Guest Room - Family Apt 7490 MaryEllen Manhattan Upper West Side 40.80178 -73.96723 Private room 79 2 118 2017-07-21 0.99 1 0
9 5238 Cute & Cozy Lower East Side 1 bdrm 7549 Ben Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 1 160 2019-06-09 1.33 4 188
In [3]:
#profiling helps understanding the distribution of data
pandas_profiling.ProfileReport(airbnb)
C:\Users\Vaibhavi\Anaconda3\lib\site-packages\pandas_profiling\describe.py:392: FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
Out[3]:

Overview

Dataset info

Number of variables 16
Number of observations 48895
Total Missing (%) 2.6%
Total size in memory 6.0 MiB
Average record size in memory 128.0 B

Variables types

Numeric 10
Categorical 6
Boolean 0
Date 0
Text (Unique) 0
Rejected 0
Unsupported 0

Warnings

Variables

id
Numeric

Distinct count 48895
Unique (%) 100.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 19017000
Minimum 2539
Maximum 36487245
Zeros (%) 0.0%

Quantile statistics

Minimum 2539
5-th percentile 1222400
Q1 9471900
Median 19677000
Q3 29152000
95-th percentile 35259000
Maximum 36487245
Range 36484706
Interquartile range 19680000

Descriptive statistics

Standard deviation 10983000
Coef of variation 0.57754
Kurtosis -1.2277
Mean 19017000
MAD 9514800
Skewness -0.090257
Sum 929843218533
Variance 120630000000000
Memory size 382.1 KiB
Value Count Frequency (%)  
11667455 1 0.0%
 
7851219 1 0.0%
 
33138268 1 0.0%
 
1624665 1 0.0%
 
19387402 1 0.0%
 
18516103 1 0.0%
 
29802895 1 0.0%
 
19983575 1 0.0%
 
22078678 1 0.0%
 
33684693 1 0.0%
 
Other values (48885) 48885 100.0%
 

Minimum 5 values

Value Count Frequency (%)  
2539 1 0.0%
 
2595 1 0.0%
 
3647 1 0.0%
 
3831 1 0.0%
 
5022 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
36484665 1 0.0%
 
36485057 1 0.0%
 
36485431 1 0.0%
 
36485609 1 0.0%
 
36487245 1 0.0%
 

name
Categorical

Distinct count 47906
Unique (%) 98.0%
Missing (%) 0.0%
Missing (n) 16
Hillside Hotel
 
18
Home away from home
 
17
New york Multi-unit building
 
16
Other values (47902)
48828
(Missing)
 
16
Value Count Frequency (%)  
Hillside Hotel 18 0.0%
 
Home away from home 17 0.0%
 
New york Multi-unit building 16 0.0%
 
Brooklyn Apartment 12 0.0%
 
Private Room 11 0.0%
 
Loft Suite @ The Box House Hotel 11 0.0%
 
Artsy Private BR in Fort Greene Cumberland 10 0.0%
 
Private room 10 0.0%
 
Beautiful Brooklyn Brownstone 8 0.0%
 
Private room in Williamsburg 8 0.0%
 
Other values (47895) 48758 99.7%
 
(Missing) 16 0.0%
 

host_id
Numeric

Distinct count 37457
Unique (%) 76.6%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 67620000
Minimum 2438
Maximum 274321313
Zeros (%) 0.0%

Quantile statistics

Minimum 2438
5-th percentile 815560
Q1 7822000
Median 30794000
Q3 107430000
95-th percentile 241760000
Maximum 274321313
Range 274318875
Interquartile range 99612000

Descriptive statistics

Standard deviation 78611000
Coef of variation 1.1625
Kurtosis 0.16911
Mean 67620000
MAD 64347000
Skewness 1.2062
Sum 3306280420566
Variance 6179700000000000
Memory size 382.1 KiB
Value Count Frequency (%)  
219517861 327 0.7%
 
107434423 232 0.5%
 
30283594 121 0.2%
 
137358866 103 0.2%
 
12243051 96 0.2%
 
16098958 96 0.2%
 
61391963 91 0.2%
 
22541573 87 0.2%
 
200380610 65 0.1%
 
7503643 52 0.1%
 
Other values (37447) 47625 97.4%
 

Minimum 5 values

Value Count Frequency (%)  
2438 1 0.0%
 
2571 1 0.0%
 
2787 6 0.0%
 
2845 2 0.0%
 
2868 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
274273284 1 0.0%
 
274298453 1 0.0%
 
274307600 1 0.0%
 
274311461 1 0.0%
 
274321313 1 0.0%
 

host_name
Categorical

Distinct count 11453
Unique (%) 23.4%
Missing (%) 0.0%
Missing (n) 21
Michael
 
417
David
 
403
Sonder (NYC)
 
327
Other values (11449)
47727
Value Count Frequency (%)  
Michael 417 0.9%
 
David 403 0.8%
 
Sonder (NYC) 327 0.7%
 
John 294 0.6%
 
Alex 279 0.6%
 
Blueground 232 0.5%
 
Sarah 227 0.5%
 
Daniel 226 0.5%
 
Jessica 205 0.4%
 
Maria 204 0.4%
 
Other values (11442) 46060 94.2%
 

neighbourhood_group
Categorical

Distinct count 5
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Manhattan
21661
Brooklyn
20104
Queens
5666
Other values (2)
 
1464
Value Count Frequency (%)  
Manhattan 21661 44.3%
 
Brooklyn 20104 41.1%
 
Queens 5666 11.6%
 
Bronx 1091 2.2%
 
Staten Island 373 0.8%
 

neighbourhood
Categorical

Distinct count 221
Unique (%) 0.5%
Missing (%) 0.0%
Missing (n) 0
Williamsburg
 
3920
Bedford-Stuyvesant
 
3714
Harlem
 
2658
Other values (218)
38603
Value Count Frequency (%)  
Williamsburg 3920 8.0%
 
Bedford-Stuyvesant 3714 7.6%
 
Harlem 2658 5.4%
 
Bushwick 2465 5.0%
 
Upper West Side 1971 4.0%
 
Hell's Kitchen 1958 4.0%
 
East Village 1853 3.8%
 
Upper East Side 1798 3.7%
 
Crown Heights 1564 3.2%
 
Midtown 1545 3.2%
 
Other values (211) 25449 52.0%
 

latitude
Numeric

Distinct count 19048
Unique (%) 39.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 40.729
Minimum 40.5
Maximum 40.913
Zeros (%) 0.0%

Quantile statistics

Minimum 40.5
5-th percentile 40.646
Q1 40.69
Median 40.723
Q3 40.763
95-th percentile 40.826
Maximum 40.913
Range 0.41327
Interquartile range 0.073015

Descriptive statistics

Standard deviation 0.05453
Coef of variation 0.0013389
Kurtosis 0.14884
Mean 40.729
MAD 0.04326
Skewness 0.23717
Sum 1991400
Variance 0.0029735
Memory size 382.1 KiB
Value Count Frequency (%)  
40.71813 18 0.0%
 
40.68634 13 0.0%
 
40.694140000000004 13 0.0%
 
40.68444 13 0.0%
 
40.71171 12 0.0%
 
40.68537 12 0.0%
 
40.76189 12 0.0%
 
40.76125 12 0.0%
 
40.71353 12 0.0%
 
40.690540000000006 11 0.0%
 
Other values (19038) 48767 99.7%
 

Minimum 5 values

Value Count Frequency (%)  
40.499790000000004 1 0.0%
 
40.506409999999995 1 0.0%
 
40.50708 1 0.0%
 
40.50868 1 0.0%
 
40.50873 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
40.90804 1 0.0%
 
40.91167 1 0.0%
 
40.91169 1 0.0%
 
40.91234 1 0.0%
 
40.913059999999994 1 0.0%
 

longitude
Numeric

Distinct count 14718
Unique (%) 30.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean -73.952
Minimum -74.244
Maximum -73.713
Zeros (%) 0.0%

Quantile statistics

Minimum -74.244
5-th percentile -74.004
Q1 -73.983
Median -73.956
Q3 -73.936
95-th percentile -73.866
Maximum -73.713
Range 0.53143
Interquartile range 0.046795

Descriptive statistics

Standard deviation 0.046157
Coef of variation -0.00062414
Kurtosis 5.0216
Mean -73.952
MAD 0.031613
Skewness 1.2842
Sum -3615900
Variance 0.0021304
Memory size 382.1 KiB
Value Count Frequency (%)  
-73.95676999999999 18 0.0%
 
-73.95427 18 0.0%
 
-73.95405 17 0.0%
 
-73.95136 16 0.0%
 
-73.94791 16 0.0%
 
-73.9506 16 0.0%
 
-73.95331999999999 16 0.0%
 
-73.95725 15 0.0%
 
-73.98589 15 0.0%
 
-73.95669000000001 15 0.0%
 
Other values (14708) 48733 99.7%
 

Minimum 5 values

Value Count Frequency (%)  
-74.24441999999999 1 0.0%
 
-74.24285 1 0.0%
 
-74.24084 1 0.0%
 
-74.23986 1 0.0%
 
-74.23914 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
-73.71928 1 0.0%
 
-73.71829 1 0.0%
 
-73.71795 1 0.0%
 
-73.7169 1 0.0%
 
-73.71299 1 0.0%
 

room_type
Categorical

Distinct count 3
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Entire home/apt
25409
Private room
22326
Shared room
 
1160
Value Count Frequency (%)  
Entire home/apt 25409 52.0%
 
Private room 22326 45.7%
 
Shared room 1160 2.4%
 

price
Numeric

Distinct count 674
Unique (%) 1.4%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 152.72
Minimum 0
Maximum 10000
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 40
Q1 69
Median 106
Q3 175
95-th percentile 355
Maximum 10000
Range 10000
Interquartile range 106

Descriptive statistics

Standard deviation 240.15
Coef of variation 1.5725
Kurtosis 585.67
Mean 152.72
MAD 92.451
Skewness 19.119
Sum 7467278
Variance 57674
Memory size 382.1 KiB
Value Count Frequency (%)  
100 2051 4.2%
 
150 2047 4.2%
 
50 1534 3.1%
 
60 1458 3.0%
 
200 1401 2.9%
 
75 1370 2.8%
 
80 1272 2.6%
 
65 1190 2.4%
 
70 1170 2.4%
 
120 1130 2.3%
 
Other values (664) 34272 70.1%
 

Minimum 5 values

Value Count Frequency (%)  
0 11 0.0%
 
10 17 0.0%
 
11 3 0.0%
 
12 4 0.0%
 
13 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
7703 1 0.0%
 
8000 1 0.0%
 
8500 1 0.0%
 
9999 3 0.0%
 
10000 3 0.0%
 

minimum_nights
Numeric

Distinct count 109
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 7.03
Minimum 1
Maximum 1250
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 1
Q1 1
Median 3
Q3 5
95-th percentile 30
Maximum 1250
Range 1249
Interquartile range 4

Descriptive statistics

Standard deviation 20.511
Coef of variation 2.9176
Kurtosis 854.07
Mean 7.03
MAD 7.5578
Skewness 21.827
Sum 343730
Variance 420.68
Memory size 382.1 KiB
Value Count Frequency (%)  
1 12720 26.0%
 
2 11696 23.9%
 
3 7999 16.4%
 
30 3760 7.7%
 
4 3303 6.8%
 
5 3034 6.2%
 
7 2058 4.2%
 
6 752 1.5%
 
14 562 1.1%
 
10 483 1.0%
 
Other values (99) 2528 5.2%
 

Minimum 5 values

Value Count Frequency (%)  
1 12720 26.0%
 
2 11696 23.9%
 
3 7999 16.4%
 
4 3303 6.8%
 
5 3034 6.2%
 

Maximum 5 values

Value Count Frequency (%)  
480 1 0.0%
 
500 5 0.0%
 
999 3 0.0%
 
1000 1 0.0%
 
1250 1 0.0%
 

number_of_reviews
Numeric

Distinct count 394
Unique (%) 0.8%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 23.274
Minimum 0
Maximum 629
Zeros (%) 20.6%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 1
Median 5
Q3 24
95-th percentile 114
Maximum 629
Range 629
Interquartile range 23

Descriptive statistics

Standard deviation 44.551
Coef of variation 1.9141
Kurtosis 19.53
Mean 23.274
MAD 27.566
Skewness 3.6906
Sum 1138005
Variance 1984.8
Memory size 382.1 KiB
Value Count Frequency (%)  
0 10052 20.6%
 
1 5244 10.7%
 
2 3465 7.1%
 
3 2520 5.2%
 
4 1994 4.1%
 
5 1618 3.3%
 
6 1357 2.8%
 
7 1179 2.4%
 
8 1127 2.3%
 
9 964 2.0%
 
Other values (384) 19375 39.6%
 

Minimum 5 values

Value Count Frequency (%)  
0 10052 20.6%
 
1 5244 10.7%
 
2 3465 7.1%
 
3 2520 5.2%
 
4 1994 4.1%
 

Maximum 5 values

Value Count Frequency (%)  
576 1 0.0%
 
594 1 0.0%
 
597 1 0.0%
 
607 1 0.0%
 
629 1 0.0%
 

last_review
Categorical

Distinct count 1765
Unique (%) 3.6%
Missing (%) 20.6%
Missing (n) 10052
2019-06-23
 
1413
2019-07-01
 
1359
2019-06-30
 
1341
Other values (1761)
34730
(Missing)
10052
Value Count Frequency (%)  
2019-06-23 1413 2.9%
 
2019-07-01 1359 2.8%
 
2019-06-30 1341 2.7%
 
2019-06-24 875 1.8%
 
2019-07-07 718 1.5%
 
2019-07-02 658 1.3%
 
2019-06-22 655 1.3%
 
2019-06-16 601 1.2%
 
2019-07-05 580 1.2%
 
2019-07-06 565 1.2%
 
Other values (1754) 30078 61.5%
 
(Missing) 10052 20.6%
 

reviews_per_month
Numeric

Distinct count 938
Unique (%) 1.9%
Missing (%) 20.6%
Missing (n) 10052
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.3732
Minimum 0.01
Maximum 58.5
Zeros (%) 0.0%

Quantile statistics

Minimum 0.01
5-th percentile 0.04
Q1 0.19
Median 0.72
Q3 2.02
95-th percentile 4.64
Maximum 58.5
Range 58.49
Interquartile range 1.83

Descriptive statistics

Standard deviation 1.6804
Coef of variation 1.2237
Kurtosis 42.493
Mean 1.3732
MAD 1.2389
Skewness 3.1302
Sum 53340
Variance 2.8239
Memory size 382.1 KiB
Value Count Frequency (%)  
0.02 919 1.9%
 
0.05 893 1.8%
 
1.0 893 1.8%
 
0.03 804 1.6%
 
0.16 667 1.4%
 
0.04 655 1.3%
 
0.08 596 1.2%
 
0.09 593 1.2%
 
0.06 579 1.2%
 
0.11 539 1.1%
 
Other values (927) 31705 64.8%
 
(Missing) 10052 20.6%
 

Minimum 5 values

Value Count Frequency (%)  
0.01 42 0.1%
 
0.02 919 1.9%
 
0.03 804 1.6%
 
0.04 655 1.3%
 
0.05 893 1.8%
 

Maximum 5 values

Value Count Frequency (%)  
17.82 1 0.0%
 
19.75 1 0.0%
 
20.94 1 0.0%
 
27.95 1 0.0%
 
58.5 1 0.0%
 

calculated_host_listings_count
Numeric

Distinct count 47
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 7.144
Minimum 1
Maximum 327
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 1
Q1 1
Median 1
Q3 2
95-th percentile 15
Maximum 327
Range 326
Interquartile range 1

Descriptive statistics

Standard deviation 32.953
Coef of variation 4.6126
Kurtosis 67.551
Mean 7.144
MAD 10.291
Skewness 7.9332
Sum 349305
Variance 1085.9
Memory size 382.1 KiB
Value Count Frequency (%)  
1 32303 66.1%
 
2 6658 13.6%
 
3 2853 5.8%
 
4 1440 2.9%
 
5 845 1.7%
 
6 570 1.2%
 
8 416 0.9%
 
7 399 0.8%
 
327 327 0.7%
 
9 234 0.5%
 
Other values (37) 2850 5.8%
 

Minimum 5 values

Value Count Frequency (%)  
1 32303 66.1%
 
2 6658 13.6%
 
3 2853 5.8%
 
4 1440 2.9%
 
5 845 1.7%
 

Maximum 5 values

Value Count Frequency (%)  
96 192 0.4%
 
103 103 0.2%
 
121 121 0.2%
 
232 232 0.5%
 
327 327 0.7%
 

availability_365
Numeric

Distinct count 366
Unique (%) 0.7%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 112.78
Minimum 0
Maximum 365
Zeros (%) 35.9%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 45
Q3 227
95-th percentile 359
Maximum 365
Range 365
Interquartile range 227

Descriptive statistics

Standard deviation 131.62
Coef of variation 1.1671
Kurtosis -0.99753
Mean 112.78
MAD 116.33
Skewness 0.76341
Sum 5514443
Variance 17324
Memory size 382.1 KiB
Value Count Frequency (%)  
0 17533 35.9%
 
365 1295 2.6%
 
364 491 1.0%
 
1 408 0.8%
 
89 361 0.7%
 
5 340 0.7%
 
3 306 0.6%
 
179 301 0.6%
 
90 290 0.6%
 
2 270 0.6%
 
Other values (356) 27300 55.8%
 

Minimum 5 values

Value Count Frequency (%)  
0 17533 35.9%
 
1 408 0.8%
 
2 270 0.6%
 
3 306 0.6%
 
4 233 0.5%
 

Maximum 5 values

Value Count Frequency (%)  
361 111 0.2%
 
362 166 0.3%
 
363 239 0.5%
 
364 491 1.0%
 
365 1295 2.6%
 

Correlations

Sample

id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0

Data Cleaning

The first step we will do here is cleaning our data. Here we will do operations such as getting our data into a standard format, handling null values, removing unneccesary columns or values etc.

In [4]:
airbnb.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
id                                48895 non-null int64
name                              48879 non-null object
host_id                           48895 non-null int64
host_name                         48874 non-null object
neighbourhood_group               48895 non-null object
neighbourhood                     48895 non-null object
latitude                          48895 non-null float64
longitude                         48895 non-null float64
room_type                         48895 non-null object
price                             48895 non-null int64
minimum_nights                    48895 non-null int64
number_of_reviews                 48895 non-null int64
last_review                       38843 non-null object
reviews_per_month                 38843 non-null float64
calculated_host_listings_count    48895 non-null int64
availability_365                  48895 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 6.0+ MB
In [5]:
total = airbnb.isnull().sum().sort_values(ascending=False)
percent = ((airbnb.isnull().sum())*100)/airbnb.isnull().count().sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total','Percent'], sort=False).sort_values('Total', ascending=False)
missing_data.head(40)
Out[5]:
Total Percent
reviews_per_month 10052 20.558339
last_review 10052 20.558339
host_name 21 0.042949
name 16 0.032723
availability_365 0 0.000000
calculated_host_listings_count 0 0.000000
number_of_reviews 0 0.000000
minimum_nights 0 0.000000
price 0 0.000000
room_type 0 0.000000
longitude 0 0.000000
latitude 0 0.000000
neighbourhood 0 0.000000
neighbourhood_group 0 0.000000
host_id 0 0.000000
id 0 0.000000
In [6]:
airbnb['adjusted_price'] = airbnb.price/airbnb.minimum_nights

airbnb.head()
Out[6]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 adjusted_price
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365 149.0
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355 225.0
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365 50.0
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194 89.0
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0 8.0
In [7]:
airbnb["last_review"] = pd.to_datetime(airbnb.last_review)

airbnb.head()
Out[7]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 adjusted_price
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365 149.0
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355 225.0
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaT NaN 1 365 50.0
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194 89.0
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0 8.0
In [8]:
airbnb["reviews_per_month"] = airbnb["reviews_per_month"].fillna(airbnb["reviews_per_month"].mean())
airbnb.head()
Out[8]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 adjusted_price
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.210000 6 365 149.0
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.380000 2 355 225.0
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaT 1.373221 1 365 50.0
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.640000 1 194 89.0
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.100000 1 0 8.0
In [9]:
airbnb.last_review.fillna(method="ffill", inplace=True)

airbnb.head()
Out[9]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 adjusted_price
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.210000 6 365 149.0
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.380000 2 355 225.0
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 2019-05-21 1.373221 1 365 50.0
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.640000 1 194 89.0
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.100000 1 0 8.0
In [10]:
for column in airbnb.columns:
    if airbnb[column].isnull().sum() != 0:
        print("=======================================================")
        print(f"{column} ==> Missing Values : {airbnb[column].isnull().sum()}, dtypes : {airbnb[column].dtypes}")
        
for column in airbnb.columns:
    if airbnb[column].isnull().sum() != 0:
        airbnb[column] = airbnb[column].fillna(airbnb[column].mode()[0])
        
airbnb.isnull().sum()
=======================================================
name ==> Missing Values : 16, dtypes : object
=======================================================
host_name ==> Missing Values : 21, dtypes : object
Out[10]:
id                                0
name                              0
host_id                           0
host_name                         0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
last_review                       0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
adjusted_price                    0
dtype: int64
In [11]:
pd.options.display.float_format = "{:.2f}".format
airbnb.describe()
Out[11]:
id host_id latitude longitude price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365 adjusted_price
count 48895.00 48895.00 48895.00 48895.00 48895.00 48895.00 48895.00 48895.00 48895.00 48895.00 48895.00
mean 19017143.24 67620010.65 40.73 -73.95 152.72 7.03 23.27 1.37 7.14 112.78 70.17
std 10983108.39 78610967.03 0.05 0.05 240.15 20.51 44.55 1.50 32.95 131.62 157.62
min 2539.00 2438.00 40.50 -74.24 0.00 1.00 0.00 0.01 1.00 0.00 0.00
25% 9471945.00 7822033.00 40.69 -73.98 69.00 1.00 1.00 0.28 1.00 0.00 20.00
50% 19677284.00 30793816.00 40.72 -73.96 106.00 3.00 5.00 1.22 1.00 45.00 44.50
75% 29152178.50 107434423.00 40.76 -73.94 175.00 5.00 24.00 1.58 2.00 227.00 81.50
max 36487245.00 274321313.00 40.91 -73.71 10000.00 1250.00 629.00 58.50 327.00 365.00 8000.00
In [12]:
# Drop ["id", "host_name"] because it is insignificant and also for ethical reasons.
airbnb.drop(["id", "host_name"], axis="columns", inplace=True)
airbnb.head()
Out[12]:
name host_id neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 adjusted_price
0 Clean & quiet apt home by the park 2787 Brooklyn Kensington 40.65 -73.97 Private room 149 1 9 2018-10-19 0.21 6 365 149.00
1 Skylit Midtown Castle 2845 Manhattan Midtown 40.75 -73.98 Entire home/apt 225 1 45 2019-05-21 0.38 2 355 225.00
2 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Manhattan Harlem 40.81 -73.94 Private room 150 3 0 2019-05-21 1.37 1 365 50.00
3 Cozy Entire Floor of Brownstone 4869 Brooklyn Clinton Hill 40.69 -73.96 Entire home/apt 89 1 270 2019-07-05 4.64 1 194 89.00
4 Entire Apt: Spacious Studio/Loft by central park 7192 Manhattan East Harlem 40.80 -73.94 Entire home/apt 80 10 9 2018-11-19 0.10 1 0 8.00
In [13]:
categorical_col = []
for column in airbnb.columns:
    if len(airbnb[column].unique()) <= 10:
        print("===============================================================================")
        print(f"{column} : {airbnb[column].unique()}")
        categorical_col.append(column)
===============================================================================
neighbourhood_group : ['Brooklyn' 'Manhattan' 'Queens' 'Staten Island' 'Bronx']
===============================================================================
room_type : ['Private room' 'Entire home/apt' 'Shared room']

Exploratory Data Analysis

Exporatory Data analysis or EDA is an approach to analyzing your dataset to summarize their characteristics often with visual methods. For the above given dataset we have explored the attributes using appropriate graphical model. This will help us to understand the nature of our data, its behavoir and so on. In the below sections we will analyze our data that with try to answers quesion like why, where and how the factors affect the airbnb ratings and prices.

In [14]:
import plotly.graph_objs as go

#Access token from Plotly
mapbox_access_token = 'pk.eyJ1Ijoia3Jwb3BraW4iLCJhIjoiY2pzcXN1eDBuMGZrNjQ5cnp1bzViZWJidiJ9.ReBalb28P1FCTWhmYBnCtA'

#Prepare data for Plotly
data = [
    go.Scattermapbox(
        lat=airbnb.latitude,
        lon=airbnb.longitude,
        mode='markers',
        text=airbnb[['neighbourhood_group','number_of_reviews','adjusted_price']],
        marker=dict(
            size=7,
            color=airbnb.adjusted_price,
            colorscale='RdBu',
            reversescale=True,
            colorbar=dict(
                title='Adjusted Price'
            )
        ),
    )
]
In [15]:
#Prepare layout for Plotly
layout = go.Layout(
    autosize=True,
    hovermode='closest',
    title='NYC Airbnb ',
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=40.721319,
            lon=-73.987130
        ),
        pitch=0,
        zoom=11
    ),
)
In [16]:
from plotly.offline import init_notebook_mode, iplot
#Create map using Plotly
fig = dict(data=data, layout=layout)
iplot(fig, filename='NYC Airbnb')
In [17]:
airbnb[airbnb.adjusted_price > 5000]
Out[17]:
name host_id neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 adjusted_price
3720 SuperBowl Penthouse Loft 3,000 sqft 1483320 Manhattan Little Italy 40.72 -74.00 Entire home/apt 5250 1 0 2016-03-13 1.37 1 0 5250.00
3774 SUPER BOWL Brooklyn Duplex Apt!! 11598359 Brooklyn Clinton Hill 40.69 -73.96 Entire home/apt 6500 1 0 2018-07-14 1.37 1 0 6500.00
4377 Film Location 1177497 Brooklyn Clinton Hill 40.69 -73.97 Entire home/apt 8000 1 1 2016-09-15 0.03 11 365 8000.00
15560 Luxury townhouse Greenwich Village 66240032 Manhattan Greenwich Village 40.73 -74.00 Entire home/apt 6000 1 0 2017-12-30 1.37 1 0 6000.00
29662 East 72nd Townhouse by (Hidden by Airbnb) 156158778 Manhattan Upper East Side 40.77 -73.96 Entire home/apt 7703 1 0 2018-09-21 1.37 12 146 7703.00
29664 Park Avenue Mansion by (Hidden by Airbnb) 156158778 Manhattan Upper East Side 40.79 -73.95 Entire home/apt 6419 1 0 2018-09-21 1.37 12 45 6419.00
42523 70' Luxury MotorYacht on the Hudson 7407743 Manhattan Battery Park City 40.71 -74.02 Entire home/apt 7500 1 0 2019-05-31 1.37 1 364 7500.00
44034 3000 sq ft daylight photo studio 3750764 Manhattan Chelsea 40.75 -74.00 Entire home/apt 6800 1 0 2019-06-15 1.37 6 364 6800.00
45666 Gem of east Flatbush 262534951 Brooklyn East Flatbush 40.66 -73.92 Private room 7500 1 8 2019-07-07 6.15 2 179 7500.00
In [18]:
import plotly.express as px

## Setting up the Visualization..
fig = px.scatter_mapbox(airbnb, 
                        hover_data = ['price','minimum_nights','room_type'],
                        hover_name = 'neighbourhood',
                        lat="latitude", 
                        lon="longitude", 
                        color="neighbourhood_group", 
                        size="price",
#                         color_continuous_scale=px.colors.cyclical.IceFire, 
                        size_max=30, 
                        opacity = .70,
                        zoom=10,
                       )
# "open-street-map", "carto-positron", "carto-darkmatter", "stamen-terrain", "stamen-toner" or 
# "stamen-watercolor" yeild maps composed of raster tiles from various public tile servers which do 
# not require signups or access tokens
# fig.update_layout(mapbox_style="carto-positron", 
#                  )
fig.layout.mapbox.style = 'stamen-terrain'
fig.update_layout(title_text = 'Airbnb by Borough in NYC<br>(Click legend to toggle borough)', height = 800)
fig.show()

The first graph is about the relationship between price and room type. The Shared room price is always lower than 2000 dollars. On the other hand, the private room and the entire home have the highest price in some.

In [19]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(15,12))
sns.scatterplot(x='room_type', y='price', data=airbnb)

plt.xlabel("Room Type", size=13)
plt.ylabel("Price", size=13)
plt.title("Room Type vs Price",size=15, weight='bold')
Out[19]:
Text(0.5, 1.0, 'Room Type vs Price')

Below graph shows details about price and room type based on neighborhood group. The highest price of Private Room and Entire Home/Apt is in the same area which is Manhattan. Also, Brooklyn has very-high prices both in Private Room and Entire Home/Apt. On the other hand, shared room's highest price is in the Queens area and also in Staten Island.

In [20]:
plt.figure(figsize=(20,15))
sns.scatterplot(x="room_type", y="price",
            hue="neighbourhood_group", size="neighbourhood_group",
            sizes=(50, 200), palette="Dark2", data=airbnb)

plt.xlabel("Room Type", size=13)
plt.ylabel("Price", size=13)
plt.title("Room Type vs Price vs Neighbourhood Group",size=15, weight='bold')
Out[20]:
Text(0.5, 1.0, 'Room Type vs Price vs Neighbourhood Group')
In [21]:
f,ax=plt.subplots(1,2,figsize=(18,8))
airbnb['neighbourhood_group'].value_counts().plot.pie(explode=[0,0.05,0,0,0],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Share of Neighborhood')
ax[0].set_ylabel('Neighborhood Share')
sns.countplot('neighbourhood_group',data=airbnb,ax=ax[1],order=airbnb['neighbourhood_group'].value_counts().index)
ax[1].set_title('Share of Neighborhood')
plt.show()
In [22]:
plt.figure(figsize=(10,6))
sns.distplot(airbnb[airbnb.neighbourhood_group=='Manhattan'].adjusted_price,color='maroon',hist=False,label='Manhattan')
sns.distplot(airbnb[airbnb.neighbourhood_group=='Brooklyn'].adjusted_price,color='black',hist=False,label='Brooklyn')
sns.distplot(airbnb[airbnb.neighbourhood_group=='Queens'].adjusted_price,color='green',hist=False,label='Queens')
sns.distplot(airbnb[airbnb.neighbourhood_group=='Staten Island'].adjusted_price,color='blue',hist=False,label='Staten Island')
sns.distplot(airbnb[airbnb.neighbourhood_group=='Long Island'].adjusted_price,color='lavender',hist=False,label='Long Island')
plt.title('Borough wise price destribution for adjusted_price<1000')
plt.xlim(0,1000)
plt.show()
In [23]:
#we can see from our statistical table that we have some extreme values, therefore we need to remove them for the sake of a better visualization

#creating a sub-dataframe with no extreme values / less than 500
sub_6=airbnb[airbnb.adjusted_price < 500]
#using violinplot to showcase density and distribtuion of prices 
viz_2=sns.violinplot(data=sub_6, x='neighbourhood_group', y='adjusted_price')
viz_2.set_title('Density and distribution of prices for each neighberhood_group')
Out[23]:
Text(0.5, 1.0, 'Density and distribution of prices for each neighberhood_group')

Great, with a statistical table and a violin plot we can definitely observe a couple of things about distribution of prices for Airbnb in NYC boroughs. First, we can state that Manhattan has the highest range of prices for the listings with 150 dollar price as average observation, followed by Brooklyn with 90 dollar per night. Queens and Staten Island appear to have very similar distributions, Bronx is the cheapest of them all. This distribution and density of prices were completely expected; for example, as it is no secret that Manhattan is one of the most expensive places in the world to live in, where Bronx on other hand appears to have lower standards of living.

In [24]:
from scipy.stats import norm

plt.figure(figsize=(10,10))
sns.distplot(airbnb['price'], fit=norm)
plt.title("Price Distribution Plot",size=15, weight='bold')
Out[24]:
Text(0.5, 1.0, 'Price Distribution Plot')

The above distribution graph shows that there is a right-skewed distribution on price. This means there is a positive skewness. Log transformation will be used to make this feature less skewed. This will help to make easier interpretation and better statistical analysis

Since division by zero is a problem, log+1 transformation would be better.

In [25]:
airbnb['price_log'] = np.log(airbnb.price+1)

With help of log transformation, now, price feature have normal distribution.

In [26]:
plt.figure(figsize=(12,10))
sns.distplot(airbnb['price_log'], fit=norm)
plt.title("Log-Price Distribution Plot",size=15, weight='bold')
Out[26]:
Text(0.5, 1.0, 'Log-Price Distribution Plot')

In below graph, the good fit indicates that normality is a reasonable approximation.

In [27]:
from scipy import stats

plt.figure(figsize=(7,7))
stats.probplot(airbnb['price_log'], plot=plt)
plt.show()
In [28]:
airbnb['neighbourhood_group']= airbnb['neighbourhood_group'].astype("category").cat.codes
airbnb['neighbourhood'] = airbnb['neighbourhood'].astype("category").cat.codes
airbnb['room_type'] = airbnb['room_type'].astype("category").cat.codes
airbnb.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
name                              48895 non-null object
host_id                           48895 non-null int64
neighbourhood_group               48895 non-null int8
neighbourhood                     48895 non-null int16
latitude                          48895 non-null float64
longitude                         48895 non-null float64
room_type                         48895 non-null int8
price                             48895 non-null int64
minimum_nights                    48895 non-null int64
number_of_reviews                 48895 non-null int64
last_review                       48895 non-null datetime64[ns]
reviews_per_month                 48895 non-null float64
calculated_host_listings_count    48895 non-null int64
availability_365                  48895 non-null int64
adjusted_price                    48895 non-null float64
price_log                         48895 non-null float64
dtypes: datetime64[ns](1), float64(5), int16(1), int64(6), int8(2), object(1)
memory usage: 5.0+ MB
In [29]:
airbnb_model = airbnb.drop(columns=['name','host_id', 
                                   'last_review','price','adjusted_price'])

plt.figure(figsize=(15,12))
palette = sns.diverging_palette(20, 220, n=256)
corr=airbnb_model.corr(method='pearson')
sns.heatmap(corr, annot=True, fmt=".2f", cmap=palette, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}).set(ylim=(11, 0))
plt.title("Correlation Matrix",size=15, weight='bold')
Out[29]:
Text(0.5, 1, 'Correlation Matrix')

The correlation table shows that there is no strong relationship between price and other features. This indicates no feature needed to be taken out of data.

Statistics and Machine Learning

Residual Plots

Residual Plot is strong method to detect outliers, non-linear data and detecting data for regression models. The below charts show the residual plots for each feature with the price.

An ideal Residual Plot, the red line would be horizontal. Based on the below charts, most features are non-linear. On the other hand, there are not many outliers in each feature. This result led to underfitting. Underfitting can occur when input features do not have a strong relationship to target variables or over-regularized. For avoiding underfitting new data features can be added or regularization weight could be reduced.

In this kernel, since the input feature data could not be increased, Regularized Linear Models will be used for regularization and polynomial transformation will be made to avoid underfitting.

In [30]:
airbnb_model_x, airbnb_model_y = airbnb_model.iloc[:,:-1], airbnb_model.iloc[:,-1]
In [31]:
f, axes = plt.subplots(5, 2, figsize=(15, 20))
sns.residplot(airbnb_model_x.iloc[:,0],airbnb_model_y, lowess=True, ax=axes[0, 0], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,1],airbnb_model_y, lowess=True, ax=axes[0, 1],
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,2],airbnb_model_y, lowess=True, ax=axes[1, 0], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,3],airbnb_model_y, lowess=True, ax=axes[1, 1], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,4],airbnb_model_y, lowess=True, ax=axes[2, 0], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,5],airbnb_model_y, lowess=True, ax=axes[2, 1], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,6],airbnb_model_y, lowess=True, ax=axes[3, 0], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,7],airbnb_model_y, lowess=True, ax=axes[3, 1], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,8],airbnb_model_y, lowess=True, ax=axes[4, 0], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
sns.residplot(airbnb_model_x.iloc[:,9],airbnb_model_y, lowess=True, ax=axes[4, 1], 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plt.setp(axes, yticks=[])
plt.tight_layout()
C:\Users\Vaibhavi\Anaconda3\lib\site-packages\numpy\lib\function_base.py:3405: RuntimeWarning:

Invalid value encountered in median

C:\Users\Vaibhavi\Anaconda3\lib\site-packages\statsmodels\nonparametric\smoothers_lowess.py:165: RuntimeWarning:

invalid value encountered in greater_equal

Multicollinearity

Multicollinearity will help to measure the relationship between explanatory variables in multiple regression. If there is multicollinearity occurs, these highly related input variables should be eliminated from the model.

In this kernel, multicollinearity will be control with Eigen vector values results.

In [32]:
multicollinearity, V=np.linalg.eig(corr)
multicollinearity
Out[32]:
array([1.94766095, 1.64337523, 1.41516454, 1.26383356, 0.32595472,
       0.46300457, 0.66853039, 0.70054096, 0.76213034, 0.93539909,
       0.87440567])

None one of the eigenvalues of the correlation matrix is close to zero. It means that there is no multicollinearity exists in the data.

First, Standard Scaler technique will be used to normalize the data set. Thus, each feature has 0 mean and 1 standard deviation.

In [33]:
scaler = StandardScaler()
airbnb_model_x = scaler.fit_transform(airbnb_model_x)

Secondly, data will be split in a 70–30 ratio

In [34]:
X_train, X_test, y_train, y_test = train_test_split(airbnb_model_x, airbnb_model_y, test_size=0.3,random_state=42)

Now it is time to build a feature importance graph. For this Extra Trees Classifier method will be used. In the below code, lowess=True makes sure the lowest regression line is drawn.

In [35]:
lab_enc = preprocessing.LabelEncoder()

feature_model = ExtraTreesClassifier(n_estimators=50)
feature_model.fit(X_train,lab_enc.fit_transform(y_train))

plt.figure(figsize=(7,7))
feat_importances = pd.Series(feature_model.feature_importances_, index=airbnb_model.iloc[:,:-1].columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

The above graph shows the feature importance of dataset. According to that, neighborhood group and room type have the lowest importance on the model. Under this result, the model building will be made in 2 phases. In the first phase, models will be built within all features and in the second phase, models will be built without neighborhood group and room type features.

3. Model Building

Phase 1 - With All Features

Correlation matrix, Residual Plots and Multicollinearity results show that underfitting occurs on the model and there is no multicollinearity on the independent variables. Avoiding underfitting will be made with Polynomial Transformation since no new features can not be added or replaced with the existing ones.

In model building section, Linear Regression, Ridge Regression, Lasso Regression, and ElasticNet Regression models will be built. These models will be used to avoiding plain Linear Regression and show the results with a little of regularization.

First, GridSearchCV algorithm will be used to find the best parameters and tuning hyperparameters for each model. In this algorithm 5-Fold Cross Validation and Mean Squared Error Regression Loss metrics will be used.

In [36]:
def linear_reg(input_x, input_y, cv=5):
    ## Defining parameters
    model_LR= LinearRegression()

    parameters = {'fit_intercept':[True,False], 'normalize':[True,False], 'copy_X':[True, False]}

    ## Building Grid Search algorithm with cross-validation and Mean Squared Error score.

    grid_search_LR = GridSearchCV(estimator=model_LR,  
                         param_grid=parameters,
                         scoring='neg_mean_squared_error',
                         cv=cv,
                         n_jobs=-1)

    ## Lastly, finding the best parameters.

    grid_search_LR.fit(input_x, input_y)
    best_parameters_LR = grid_search_LR.best_params_  
    best_score_LR = grid_search_LR.best_score_ 
    print(best_parameters_LR)
    print(best_score_LR)
In [37]:
def ridge_reg(input_x, input_y, cv=5):
    ## Defining parameters
    model_Ridge= Ridge()

    # prepare a range of alpha values to test
    alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
    normalizes= ([True,False])

    ## Building Grid Search algorithm with cross-validation and Mean Squared Error score.

    grid_search_Ridge = GridSearchCV(estimator=model_Ridge,  
                         param_grid=(dict(alpha=alphas, normalize= normalizes)),
                         scoring='neg_mean_squared_error',
                         cv=cv,
                         n_jobs=-1)

    ## Lastly, finding the best parameters.

    grid_search_Ridge.fit(input_x, input_y)
    best_parameters_Ridge = grid_search_Ridge.best_params_  
    best_score_Ridge = grid_search_Ridge.best_score_ 
    print(best_parameters_Ridge)
    print(best_score_Ridge)
In [38]:
def lasso_reg(input_x, input_y, cv=5):
    ## Defining parameters
    model_Lasso= Lasso()

    # prepare a range of alpha values to test
    alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
    normalizes= ([True,False])

    ## Building Grid Search algorithm with cross-validation and Mean Squared Error score.

    grid_search_lasso = GridSearchCV(estimator=model_Lasso,  
                         param_grid=(dict(alpha=alphas, normalize= normalizes)),
                         scoring='neg_mean_squared_error',
                         cv=cv,
                         n_jobs=-1)

    ## Lastly, finding the best parameters.

    grid_search_lasso.fit(input_x, input_y)
    best_parameters_lasso = grid_search_lasso.best_params_  
    best_score_lasso = grid_search_lasso.best_score_ 
    print(best_parameters_lasso)
    print(best_score_lasso)
In [39]:
def elastic_reg(input_x, input_y,cv=5):
    ## Defining parameters
    model_grid_Elastic= ElasticNet()

    # prepare a range of alpha values to test
    alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
    normalizes= ([True,False])

    ## Building Grid Search algorithm with cross-validation and Mean Squared Error score.

    grid_search_elastic = GridSearchCV(estimator=model_grid_Elastic,  
                         param_grid=(dict(alpha=alphas, normalize= normalizes)),
                         scoring='neg_mean_squared_error',
                         cv=cv,
                         n_jobs=-1)

    ## Lastly, finding the best parameters.

    grid_search_elastic.fit(input_x, input_y)
    best_parameters_elastic = grid_search_elastic.best_params_  
    best_score_elastic = grid_search_elastic.best_score_ 
    print(best_parameters_elastic)
    print(best_score_elastic)

K-Fold Cross Validation

Before model building, 5-Fold Cross Validation will be implemented for validation.

In [40]:
kfold_cv=KFold(n_splits=5, random_state=42, shuffle=False)
for train_index, test_index in kfold_cv.split(airbnb_model_x,airbnb_model_y):
    X_train, X_test = airbnb_model_x[train_index], airbnb_model_x[test_index]
    y_train, y_test = airbnb_model_y[train_index], airbnb_model_y[test_index]
C:\Users\Vaibhavi\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:296: FutureWarning:

Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

Polynomial Transformation

The polynomial transformation will be made with a second degree which adding the square of each feature.

In [41]:
Poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train = Poly.fit_transform(X_train)
X_test = Poly.fit_transform(X_test)

Model Prediction

In [42]:
##Linear Regression
lr = LinearRegression(copy_X= True, fit_intercept = True, normalize = True)
lr.fit(X_train, y_train)
lr_pred= lr.predict(X_test)


#Ridge Model
ridge_model = Ridge(alpha = 0.01, normalize = True)
ridge_model.fit(X_train, y_train)             
pred_ridge = ridge_model.predict(X_test) 

#Lasso Model
Lasso_model = Lasso(alpha = 0.001, normalize =False)
Lasso_model.fit(X_train, y_train)
pred_Lasso = Lasso_model.predict(X_test) 

#ElasticNet Model
model_enet = ElasticNet(alpha = 0.01, normalize=False)
model_enet.fit(X_train, y_train) 
pred_test_enet= model_enet.predict(X_test)

Phase 2 - Without All Features

All steps from Phase 1, will be repeated in this Phase. The difference is, neighbourhood_group and room_type features will be eliminated.

In [43]:
airbnb_model_xx = airbnb_model.drop(columns=['neighbourhood_group', 'room_type'])
In [44]:
airbnb_model_xx, airbnb_model_yx = airbnb_model_xx.iloc[:,:-1], airbnb_model_xx.iloc[:,-1]
X_train_x, X_test_x, y_train_x, y_test_x = train_test_split(airbnb_model_xx, airbnb_model_yx, test_size=0.3,random_state=42)
In [45]:
scaler = StandardScaler()
airbnb_model_xx = scaler.fit_transform(airbnb_model_xx)
In [46]:
kfold_cv=KFold(n_splits=4, random_state=42, shuffle=False)
for train_index, test_index in kfold_cv.split(airbnb_model_xx,airbnb_model_yx):
    X_train_x, X_test_x = airbnb_model_xx[train_index], airbnb_model_xx[test_index]
    y_train_x, y_test_x = airbnb_model_yx[train_index], airbnb_model_yx[test_index]
C:\Users\Vaibhavi\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:296: FutureWarning:

Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

In [47]:
Poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_x = Poly.fit_transform(X_train_x)
X_test_x = Poly.fit_transform(X_test_x)
In [48]:
###Linear Regression
lr_x = LinearRegression(copy_X= True, fit_intercept = True, normalize = True)
lr_x.fit(X_train_x, y_train_x)
lr_pred_x= lr_x.predict(X_test_x)

###Ridge
ridge_x = Ridge(alpha = 0.01, normalize = True)
ridge_x.fit(X_train_x, y_train_x)           
pred_ridge_x = ridge_x.predict(X_test_x) 

###Lasso
Lasso_x = Lasso(alpha = 0.001, normalize =False)
Lasso_x.fit(X_train_x, y_train_x)
pred_Lasso_x = Lasso_x.predict(X_test_x) 

##ElasticNet
model_enet_x = ElasticNet(alpha = 0.01, normalize=False)
model_enet_x.fit(X_train_x, y_train_x) 
pred_train_enet_x= model_enet_x.predict(X_train_x)
pred_test_enet_x= model_enet_x.predict(X_test_x)

4. Model Comparison

In this part, 3 metrics will be calculated for evaluating predictions.

  • Mean Absolute Error (MAE) shows the difference between predictions and actual values.

  • Root Mean Square Error (RMSE) shows how accurately the model predicts the response.

  • R^2 will be calculated to find the goodness of fit measure.
  • In [49]:
    print('-------------Linear Regression-----------')
    
    print('--Phase-1--')
    print('MAE: %f'% mean_absolute_error(y_test, lr_pred))
    print('RMSE: %f'% np.sqrt(mean_squared_error(y_test, lr_pred)))   
    print('R2 %f' % r2_score(y_test, lr_pred))
    
    print('--Phase-2--')
    print('MAE: %f'% mean_absolute_error(y_test_x, lr_pred_x))
    print('RMSE: %f'% np.sqrt(mean_squared_error(y_test_x, lr_pred_x)))   
    print('R2 %f' % r2_score(y_test_x, lr_pred_x))
    
    print('---------------Ridge ---------------------')
    
    print('--Phase-1--')
    print('MAE: %f'% mean_absolute_error(y_test, pred_ridge))
    print('RMSE: %f'% np.sqrt(mean_squared_error(y_test, pred_ridge)))   
    print('R2 %f' % r2_score(y_test, pred_ridge))
    
    print('--Phase-2--')
    print('MAE: %f'% mean_absolute_error(y_test_x, pred_ridge_x))
    print('RMSE: %f'% np.sqrt(mean_squared_error(y_test_x, pred_ridge_x)))   
    print('R2 %f' % r2_score(y_test_x, pred_ridge_x))
    
    print('---------------Lasso-----------------------')
    
    print('--Phase-1--')
    print('MAE: %f' % mean_absolute_error(y_test, pred_Lasso))
    print('RMSE: %f' % np.sqrt(mean_squared_error(y_test, pred_Lasso)))
    print('R2 %f' % r2_score(y_test, pred_Lasso))
    
    print('--Phase-2--')
    print('MAE: %f' % mean_absolute_error(y_test_x, pred_Lasso_x))
    print('RMSE: %f' % np.sqrt(mean_squared_error(y_test_x, pred_Lasso_x)))
    print('R2 %f' % r2_score(y_test_x, pred_Lasso_x))
    
    print('---------------ElasticNet-------------------')
    
    print('--Phase-1 --')
    print('MAE: %f' % mean_absolute_error(y_test,pred_test_enet)) #RMSE
    print('RMSE: %f' % np.sqrt(mean_squared_error(y_test,pred_test_enet))) #RMSE
    print('R2 %f' % r2_score(y_test, pred_test_enet))
    
    print('--Phase-2--')
    print('MAE: %f' % mean_absolute_error(y_test_x,pred_test_enet_x)) #RMSE
    print('RMSE: %f' % np.sqrt(mean_squared_error(y_test_x,pred_test_enet_x))) #RMSE
    print('R2 %f' % r2_score(y_test_x, pred_test_enet_x))
    
    -------------Linear Regression-----------
    --Phase-1--
    MAE: 0.377923
    RMSE: 0.522021
    R2 0.527663
    --Phase-2--
    MAE: 0.531963
    RMSE: 0.685894
    R2 0.184227
    ---------------Ridge ---------------------
    --Phase-1--
    MAE: 0.377915
    RMSE: 0.522038
    R2 0.527631
    --Phase-2--
    MAE: 0.529255
    RMSE: 0.679340
    R2 0.199742
    ---------------Lasso-----------------------
    --Phase-1--
    MAE: 0.375922
    RMSE: 0.520400
    R2 0.530591
    --Phase-2--
    MAE: 0.523562
    RMSE: 0.671290
    R2 0.218595
    ---------------ElasticNet-------------------
    --Phase-1 --
    MAE: 0.371707
    RMSE: 0.518862
    R2 0.533362
    --Phase-2--
    MAE: 0.524883
    RMSE: 0.670878
    R2 0.219553
    

    The results show that all models have similar prediction results. Phase 1 and 2 have a great difference for each metric. All metric values are increased in Phase 2 it means, the prediction error value is higher in that Phase and model explainability are very low the variability of the response data around mean.

  • The MAE value of 0 indicates no error on the model. In other words, there is a perfect prediction. The above results show that all predictions have great error especially in phase 2.
  • RMSE gives an idea of how much error the system typically makes in its predictions. The above results show that all models with each phase have significant errors.
  • R2 represents the proportion of the variance for a dependent variable that's explained by an independent variable. The above results show that, in phase 1, 52% of data fit the regression model while in phase 2, 20% of data fit the regression model.
  • In [50]:
    from sklearn.model_selection import train_test_split, cross_val_score
    
    def rmse_cv(model):
        kf = KFold(5, shuffle=True, random_state = 91).get_n_splits(airbnb.price)
        return cross_val_score(model, X_train, y_train, scoring='neg_mean_squared_error', cv=kf)
    
    In [51]:
    from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
    
    best_random = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=30,
                          max_features='sqrt', max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=5,
                          min_weight_fraction_leaf=0.0, n_estimators=1400,
                          n_jobs=None, oob_score=False, random_state=42, verbose=0,
                          warm_start=False)
    
    rfr_CV_best = -rmse_cv(best_random)
    best_random.fit(X_train, y_train) 
    y_train_rfr = best_random.predict(X_train)
    y_test_rfr = best_random.predict(X_test)
    rfr_best_results = pd.DataFrame({'algorithm':['Random Forest Regressor'],
                'CV error': rfr_CV_best.mean(), 
                'CV std': rfr_CV_best.std(),
                'training error': [mean_squared_error(y_train, y_train_rfr)],
                'test error': [mean_squared_error(y_test, y_test_rfr)],
                'training_r2_score': [r2_score(y_train, y_train_rfr)],
                'test_r2_score': [r2_score(y_test, y_test_rfr)]})
    rfr_best_results
    
    Out[51]:
    algorithm CV error CV std training error test error training_r2_score test_r2_score
    0 Random Forest Regressor 0.20 0.01 0.04 0.24 0.91 0.59

    Conclusion:

    Summarizing our findings

    This Airbnb ('AB_NYC_2019') dataset for the 2019 year appeared to be a very rich dataset with a variety of columns that allowed us to do deep data exploration on each significant column presented.
    By creating a map which shows adjusted price of each and every listing, we saw how the pricing was distributed for each and every listing over the New York.Also, how listings are distributed according to borough, number of listings belonging to each borough, how pricing was distributed among each and every borough was examined.
    From this, we saw the mean price of listings for each and every borough which will help the customer to not overpay in specific area and not get fooled by the hosts.
    A model was fitted fitted to predict price v/s every feature and price v/s all features expect neighbourhood_group, room_type. We saw that price dependent on each and every feature as the error for them was less than phase 2.
    Also, The testing R-squared was around 60% which suggested the prediction model selected was good enough to predict the price.
    Finally ElasticNet was the best model among others to predict the price.

    Future Work

    For our data exploration purposes, it also would be nice to have couple additional features, such as positive and negative numeric (0-5 stars) reviews or 0-5 star average review for each listing; addition of these features would help to determine the best-reviewed hosts for NYC along with 'number_of_review' column that is provided.
    If ratings were provided by the customers for each and every listing, depending on that, a recommendation system could be generated for them which will help to find the best listings according to their needs.